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(54) Automatic search of audio channels by matching viewer-spoken words against dosed- 
caption text or audio content for interactive television 

(57) A method arxl apparatus is provided to enable 
a user watching and/or listening to a program to search 
for new information in the stream of a telecommunica- 
tions data. The apparatus includes a voice recognition 
system that recognizes the user's request and causes a 
search to be performed in the long stream of data of at 
least one other telecommunication channel. The system 
includes a storage device for storing and processing the 
request. Upon recognition of the request, the incoming 
signal or signals are scanned for matches with the 
request Upon finding the match between the request 
and the incoming signal, information related to the data 
is brought to the viewer's attention. This can be accom- 
plished by either changing the viewer's station or by 
bringing in a split screen display forward into the display. 
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Descripti n 

Backar und and Summary of the Invention 

[0001] Tlie present invention relates generally to 
interactive television and more particularly, to a system 
that allows the user to select channels by spoken 
request. 

[0002] Interactive television promises to allow two- 
way communication between the viewer and his or her 
television set. Although the technology is still in its 
infancy, digital television is expected to greatly enrich 
the prospects for interactive TV, because the digital 
technology makes possible a far nnore efficient use of 
available channel bandwidth. Through digital technol- 
ogy» broadcasters can pack a significantly larger 
number of programs into the available bandwidth of the 
delivery irrfrastructure (e.g. cable or satellite). 
[0003] While the new interactive, digital television 
technology offers a significant number of benefits to 
both viewers and broadcasters, it is not without prob- 
lems. The prospect of having 200 or more channels 
simultaneously available for viewing, boggles the mind. 
Conventional on-screen electronic program guides are 
likely to prove inadequate in assisting viewers to find 
programs they are interested in. Interactive digital televi- 
sion demands a more sophisticated system of interac- 
tion if the viewers are ever going to be able to fully utilize 
this rich new resource. 

[0004] The present invention provides a speech - 
enabled interactive system through which a user can 
specify a desired program content through natural lan- 
guage speech. The system extracts both keyword and 
semantic content from the user's speech, prompting the 
user to furnish additional information if the meaning is 
unclear 

[0005] The system then monitors closed caption 
information on multiple channels simultaneously and 
switches the active channel tuner or auxiliary tuner to 
the channel carrying information matching the user's 
request. If closed caption information is not available, 
the system will alternatively employ speech recognition 
upon the audio signal of the channels being monitored. 
Once the channel has been switched, the program may 
be displayed in full screen mode, or in split-screen or 
picture-in-picture mode, or recorded for later viewing. 
[0006] The speech recognition system works with a 
semantic analyzer that is able to discriminate between 
speech intended to describe program content and 
speech intended to supply meta-commands to the sys- 
tem. By extracting meaning as well as keywords and 
phrases from the spoken input, the system will find 
matching content even when the spoken words do not 
match the closed caption text verbatim, 
[0007] For a more complete understanding of the 
invention, its objects and advantages, refer to the follow- 
ing specification and to the accompanying drawings. 



Brief Description of the Drawings 
[0008] 

5 Figure 1 is a block diagram of a presently preferred 

embodiment of the invention. 

Figure 2 is a data flow diagram illustrating the word 
selector and semantic analyzer component of the 
preferred ennbodiment. 

10 

Detailed Description of the Preferred Embodiments 

[0009] Referring to Figure 1 . the interactive content 
searching system of tiie invention may be integrated 
IS into the televiaon set 10. or Into a set top box 12. In 
either embodiment, the system is designed to monitor 
one or more channels not currently being viewed, to 
detect closed caption text or audio channel speech that 
matches the user's previously spoken request. In Figure 
20 1 , a plurality of tuiers has been illustrated, including an 
active channel tuier 14 and a plurality of auxiliary tun- 
ers 16. In the iliustrated embodiment it is assumed that 
there are n auxiliary tuners (where n is an integer 
number greater than 0). In its simplest form, the inven- 
ts tion may be implemented using a single auxiliary tuner. 
[0010] The active channel tuner 14 is tuned to a 
channel set by the user and this tuner thus selects the 
channel the user is currently watching on television set 
10. If desired, ore or more of the auxiliary tuners may 
30 also supply pro^m content for viewing on television 
set 1 0, such as in a split-screen mode or picture-in-pic- 
ture mode. In Figure 1, the auxiliary tuner, labeled tuner 
n, is connected to supply program content to television 
set 10. 

35 [0011] Using current tuner technology, tiie active 
channel tuner 14 and auxiliary tuners 16 select the 
desired channel by selecting the corresponding fre- 
quency band through bandpass filtering of the RF sig- 
nal. While tuners of this type may be employed to 

40 implement the irwention. other forms of digital "channel" 
selection are also envisioned, whereby the desired pro- 
gram content is ectracted from the video data stream in 
the digital donr«in- For purposes of implementing the 
invention, the manner of channel selection depends 

45 upon the manner in which the television signals are 
encoded and broadcast. 

[001 2] Regsjcless of the form of the signals used to 
broadcast program material, the auxiliary tuners 16 are 
each set to moritcr a different program channel, so that 

so the closed caption text information and audio signal may 
be monitored by the system. The user selects which 
channels to nricnitor, using either on-screen menu 
selection or voiced meta-commands 
[001 3] The system employs a speech recognizer 1 8 

55 with which the user communicates through a suitable 
microphone 20. Microphone 20 may be incorporated 
into the television set or set top box. however the pres- 
ently preferred embodiment incorporates the micro- 
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phone Into a hand-held remote control unit 22. which 
communicates with the television set or set top box by 
suitable link, such as an infrared or hard wired link. 
[0014] Speech recognizer 18 works in conjunction 
with a set of speech models 24 representing all words 
recognizable by the system. The . speech recognizer 
may be based on Hidden Markov Model (HMM) technol- 
ogy, or other suitable model -based recognition technol- 
ogy. The dictionary or lexicon of words recognizable by 
the system may include not only words, but letters of the 
alphabet, thereby allowing the system to recognize let- 
ters spoken by the user in spelling other new words. As 
will be more fuliy explained below, inclusion of speech 
models for letters of the alphabet allows the user to train 
the speech recognizer to learn new words even if a key- 
board is not available for typing. 
[0015] Speech recognizer 18. in effect, converts 
spoken utterances into text corresponding to the most 
probable word or phrase candidates (or letter candi- 
dates) recognized by the system. In the presently pre- 
ferred embodiment, speech recognizer 18 outputs the 
N-best sentence candidates for each sentence utter- 
ance spoken by the user. The recognizer generates a 
probability score for each sentence, indicative of the 
likelihood that the sentence corresponds to the spoken 
utterance. The top N candidates are selected and fed to 
the word selector and semantic analyzer block 26 for 
further processing. 

[001 6] Word selector and semantic analyzer block 
26 performs several functions. First, it resolves which of 
the N-best recognition candidates were actually 
intended by the user. Second, it analyzes the semantic 
content of the user's entire utterance, to determine addi- 
tional information about the user's request that may not 
be gleaned from the individual words, themselves. 
Third, the semantic analyzer also analyzes the user's 
input to resolve recognition errors and to determine 
whether the user's input speech represents description 
of program content or represents meta- commands 
intended as instructions to effect system operation. 
[0017] The word selector and semantic analyzer 
uses a combined local parser and global parser to 
select the correct candidate from the N-best candidates 
and also to perform semantic analysis. The details of 
these parser components are described more fully 
below. The word selector and semantic analyzer works 
with a dialog manager 28 that helps resolve ambiguities 
by prompting the user to supply additional information to 
specify either the program content or the m eta-com- 
mand. 

[0018] Dialog manager 28 can supply either text 
prompt or voiced prompts. Text prompts are generated 
as alphanumeric text that is suitably injected into the 
video signal for on-screen display Voiced prompts are 
supplied by a speech synthesizer within.the dialog man- 
ager and may be injected into the audio stream for 
replay through the television speaker system. 
[0019] . If desired, a word history data store 30 may 



be provided to store a record of previously resolved 
word ambiguities, allowing the system to "learn" the 
users viewing habits, thereby assisting the word selec- 
tor in resolving subsequent word recognition ambigui- 
5 ties. 

[0020] The word selector and semantic analyzer is 
designed to extract the meaning behind the user's 
request for a channel selection and it will automatically 
select applicable synonyms to improve the text match- 

10 ing process. Thus, if the word selector and semantic 
analyzer determines that the user is interested in watch- 
ing a football game, synonyms and related words, such 
as louch down," "kick-off." "NFL," "Superbowl." and the 
like are extracted from the word selector's synonym 

15 database 32. 

[0021] The extracted words along with the user's 
originally spoken word are then sent to a word list buffer 
34 that serves as a dynamic dictionary for the text 
matching processor 36. Text matching processor 36 

20 receives individual streams of closed caption text data 
and/or audio data from the auxiliary tuner's 16 as that 
information is broadcast live and selected by tiie 
respective tuners. If audio data is supplied by an auxil- 
iary tuner, text matching processor 36 employs the serv- ' 

25 ices of speech recognizer 18 to convert the audio 
stream into text data. 

[0022] Text matching processor 36 compares each' 
of the incoming text streams from the auxiliary tuners 1 6 
with the words contained in word list buffer 34. If a 

30 match is detected, processor 36 signals the channel 
switcher 38, which, in turn triggers a number of different 
actions, depending upon the mode set by the user. 
[0023] In a first rnode. channel switcher 38 sends a 
command to the active channel tuner 14, causing the 

35 active channel tuner to immediately switch to the chan- 
nel on which the detected word match occurred. The 
user is thus immediately switched to the channel con- 
taining the content he or she previously requested. 
[0024] In a second mode, channel switcher 38 

40 switches one of the auxiliary tuners (such as tuner n) to 
the channel that triggered the word match. In this mode,' 
the viewer continues to watch the active channel, but is 
also presented with a picture-in-picture or a split screen 
view of the other channel detected. 

45 [0025] In a third mode, the channel switcher acti- 
vates a recorder 40, such as a DVD recorder, that will 
record the program on the tuner that triggered tiie word 
match. This mode allows the viewer to continue watch- 
ing the active channel, while the system records the 

50 other selected channel for later viewing. 

[0026] The speech recognizer that forms the heart 
of the word recognition system of the invention is prefer- 
ably provided with a set of speech models 24 represent- 
ing speaker independent word and letter templates for 

55 ttie most popular words used to describe program con- 
tent. However, to give the system added flexibility, a 
rrxxlel training processor 42 may be provided to allow 
an individual user to add words to the speech model die- 
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tionary. The mode! training processor 42 takes as its 
input two pieces of information: (a) speech information 
corresponding to new words the user wishes to add to 
the dictionary and (b) text information representing the 
spelling of those new words. Speech information is pro- 
vided via microphone 20, in the same fashion as speech 
information is provided to recognizer 18. Text informa- 
tion may be provided via a keyboard 44 or other suitable 
text entry device, including on-screen text entry system 
employing the keypad buttons of the remote control 22. 
[00271 As an alternate means of inputting text infor- 
mation, the speech recognizer 18 may be used. In this 
alternate mode, the speaker both speaks the new word 
and then spells it, by speaking into microphone 20. 
Speech recognizer 18 uses its speech models of 
spelled letters to interpret the spelled word input and 
correlate that with the spoken utterance representing 
the word itself. The model training processor 42 Then 
constructs speech models using the same model 
parameters upon which the initially supplied speech 
models are based. 

[0028] The word selector and semantic analyzer 26 
performs the important function of making sense of the 
user s natural language spoken input. The task of the 
word selector and semantic analyzer is thus more com- 
plex than merely spotting keywords within a stream of 
speech recognized text. The analyzer extracts not only 
the important keywords but also the context of those 
words, so that the semantic content or meaning of the 
spoken input can be determined. The word selector and 
semantic analyzer employs a dual parser system for this 
purpose. That system is shown diagrammatically in Fig- 
ure 2. 

[0029] Referring to Figure 2, the analyzer maintains 
a frame data store 50 in which a plurality of task-based 
frames or templates are stored. The data structure of 
these templates is illusUated diagrammatically at 52. 
Each frame comprises a plurality of slots 54 into which 
extracted keywords are placed as the word selector and 
semantic analyzer operates. 

[0030] A local parser 56, based on an LR grammar 
58, parses the text stream 60 supplied by the speech 
recognizer 18 (Rg. 1). The LR grammar allows the local 
parser to detect and label sentence fragments within the 
text stream that contain important keywords used to 
select words for filling the word list buffer 34 (Fig. 1 ) For 
example, local parser 56 contains an LR grammar to 
extract the keyword "football" from the following sen- 
tence: 

"I think I would like to watch a football game 

tonight." 

[0031] Using its LR grammar, the local parser 
decodes the above sentence by examining the structure 
of the sentence and determines that the object of the 
sentence is •football game" and that the user has also 
specified a timeframe parameter, namely "tonight". 
[0032] Local parser 56 then accesses a data store 
of keyword tags 62 to extract meaning from the key- 



words and phrases. The keyword tags data store rnay 
be structured to give a frame tag and slot tag identifier 
for each phrase or keyword. The keyword 'Mootball" 
might have a frame tag of "sports" and a slot tag of 
5 "sports type." These keyword tags allow the local parser 
to determine, which frame, within data store .52 to use - 
and which slot 54 the identified phrase or keyword 
should be assigned. 

[0033] Each of the frames within frame data store 
10 50 is goal-oriented. That is. each frame corresponds to 
a different media content selection task or system oper- 
ation task. The range of tasks can be as varied as the 
user wishes. In a typical embodiment suitable for con- 
sumer applications, the system may be provided with a 
15 predefined set of frames corresponding to each of the 
available system operation commands and to a variety 
of typical program content requests. The user could 
thus speak into the system to perform a system com- 
mand, such as instructing the system to record an iden- 
20 tified program instead of displaying it through the active 
channel tuner. A user command such as: 

*'l want to record the Seinfeld re-run tomorrow 

night.'* 

would cause the system to enter a record mode. The 
25 above command would also be parsed by the local 
parser to identify the users requested program content, 
namely the Seinfeld re-run. 
[0034] Similarly, the user could utter: 
"I want to watch Seinfeki now." 
30 [0035] This would cause the system to immediately 
switch channels to the one can-ying the Seinfeld broad- 
cast. 

[0036] In some instances, the LR grammar of the 
local parser may not be sufficient to resolve the user s 
35 input without ambiguity. This will occur where the local 
parser identifies sentence fragments that, taken out of 
context, may have several meanings. For example, the 
following input: 

"I want to watch Seinfeld and record it" 
40 presents the following ambiguity. The local parser may 
determine with equal validity that the program 
requested by the user is either (a) "Seinfeld" or (b) 
"Seinfeld And Record It." 

[0037] To resolve such ant>iguities, the system 
45 includes a second parser, the global parser 70. The glo- 
bal parser 70 also monitors the text stream as well as 
receiving input from the local parser 56. The global 
parser has a set of decision trees 72 that it uses to 
resolve ambiguities such as the one illustrated above. 
so More specifically, global parser 70 has a set of decision 
trees 72, one decision tree for each meaning. Each 
decision tree is also in charge of solving ambiguities in 
the meaning represented. Each decision tree is a binary 
tree structure in which the root node and intermediate 
55 nodes each contain a question that may be answered 
either YES or NO. Answering a given question branches 
left or right to a successively lower node, depending on 
whether the answer was YES or NO. The final nodes or 
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leaf nodes contain the determination of the meaning 
that has been expressed. The system uses this decision 
information to resolve ambiguities in selecting the 
proper frame from frame data store 50 and in assigning 
keywords to the proper slots. s 
[0038] After the frame data store has been popu- 
lated by the local and global parsers, the word selector 
module 74 accesses the data store 50 to obtain the 
applicable list of Keywords for sending to the word list 
buffer 34. The selector module may employ the services io 
of an electronic thesaurus 76 to generate synonyms or 
additional words to enrich the key word list supplied to 
the word list buffer. The word list selector module might 
for example, extract the word '^otbair from frame data 
store 50 and obtain additional words such as "touch- 75 
down." "Green Bay Packers," or "NFL" from the thesau- 
rus 76. In this regard, note that the additional words 
selected need not necessarily constitute synonyms in 
the dictionary sense. Rather, they may constitute addi- 
tional words or related words that are often found in nat- so 
ural language speech involving the subject of the user- 
specified key word. 

[0039] From the foregoing, it will be appreciated 
that the automatic search mechanism of the invention 
greatly eases the task of identifying program material in 25 

a television system having access to many channels of 
information. While the invention has been described in 
its presently preferred embodiment, it will be under- 
stood that the invention is capable of modification and 
change without departing from the spirit of the invention 30 
as set forth In the appended claims. 

Claims 

1 . A system receiving input from a telecommunication 35 
infra-structure and displaying the information on a 
display, the input signal having a plurality of infor- 
mation components, said system comprising: 

a speech recognizer for receiving user spoken 40 
request from a user and producing a first out- 
put: 

a semantic analyzer for processing the first out- 
put to produce a word list; 

and text pattern matcher for comparing the 45 
word list with the plurality of information com- 
ponents. 

2. The system of claim 1 further comprising a plurality 

of digital tuners for parsing the inputs signal into the so 
information components. 

3. The system of claim 2 wherein the speech recog- 
nizer further contains a plurality of speech models, 
each model representing either a sub-word unit or a ss 
letter template. 

4. The system of claim 3 wherein the semantic ana- 



lyzer contains a natural language analyzer which 
recognizes at least one of: synonymous, spelled 
word, system commands. 

5. The system of claim 4 wherein the speech recog- 
nizer provides a plurality of likely requests and the 
word selector determines which request in the word 
list will be used in the search. 

6. The system of claim 5 wherein the semantic ana- 
lyzer stores historical information or past searches 
and uses historical information in its determination 
of which words from the word list will be searched. 

7. The system of claim 5 wherein the word selector 
semantic analyzer contains a local parser and a 
gIot)al parser. 

8. The system of claim 5 wherein the word selector 
semantic analyzer provides synonyms of search 
terms to the text pattern matcher. 

9. The system of claim 5 wherein the text pattern 
matcher compares the word list to the data from the- 
plurality of tuners. 
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(54) Automatic search of audio channels by matching viewer-spoken words against closed- 
caption text or audio content for interactive television 

(57) A nnethod and apparatus is provided to enable 
a user watching and/or listening to a program to search 
for new information in the stream of a telecommunica- 
tions data. The apparatus includes a voice recognition 
system that recognizes the user's request and causes a 
search to be performed in the long stream of data of at 
least one other telecommunication channel. The system 
includes a storage device for storing and processing the 
request. Upon recognition of the request, the incoming 
signal or signals are scanned for matches with the 
request. Upon finding the match between the request 
and the incoming signal, infonnation related to the data 
is brought to the viewer's attention. This can be accom- 
plished by either changing the viewer's station or by 
bringing in a split screen display forward into the display. 
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